Improving Persian-English Statistical Machine Translation:Experiments in Domain Adaptation
نویسندگان
چکیده
This paper documents recent work carried out for PeEn-SMT, our Statistical Machine Translation system for translation between the English-Persian language pair. We give details of our previous SMT system, and present our current development of significantly larger corpora. We explain how recent tests using much larger corpora helped to evaluate problems in parallel corpus alignment, corpus content, and how matching the domains of PeEn-SMT’s components affect translation output. We then focus on combining corpora and approaches to improve test data, showing details of experimental setup, together with a number of experiment results and comparisons between them. We show how one combination of corpora gave us a metric score outperforming Google Translate for the English-toPersian translation. Finally, we outline areas of our intended future work, and how we plan to improve the performance of our system to achieve higher metric scores, and ultimately to provide accurate, reliable language translation.
منابع مشابه
Use of linguistic features for improving English-Persian SMT
In this paper, we investigate the effects of using linguistic information for improvement of statistical machine translation for English-Persian language pair. We choose POS tags as helping linguistic feature. A monolingual Persian corpus with POS tags is prepared and variety of tags is chosen to be small. Using the POS tagger trained on this corpus, we apply a factored translation model. We al...
متن کاملImproving English-Spanish Statistical Machine Translation: Experiments in Domain Adaptation, Sentence Paraphrasing, Tokenization, and Recasing
We describe the experiments of the UC Berkeley team on improving English-Spanish machine translation of news text, as part of the WMT’08 Shared Translation Task. We experiment with domain adaptation, combining a small in-domain news bi-text and a large out-of-domain one from the Europarl corpus, building two separate phrase translation models and two separate language models. We further add a t...
متن کاملPersianSMT: A first attempt to English-Persian Statistical Machine Translation
In this paper, an attempt to develop a phrase-based statistical machine translation between English and Persian languages (PersianSMT) is described. Creation of the largest English-Persian parallel corpus yet presented by the use of movie subtitles is a part of this work. Two major goals are followed here: the first one is to show the main problems observed in the output of the PersianSMT syste...
متن کاملOrthographic and Morphological Processing for Persian-to-English Statistical Machine Translation
In statistical machine translation, data sparsity is a challenging problem especially for languages with rich morphology and inconsistent orthography, such as Persian. We show that orthographic preprocessing and morphological segmentation of Persian verbs in particular improves the translation quality of Persian-English by 1.9 BLEU points on a blind test set.
متن کاملA corpus-based translation study on English-Persian verb phrase ellipsis
The present research is a descriptive corpus-based translation study aiming at pinpointing the patterns of translation into Persian when dealing with English Verb Phrase Ellipsis (VPE). After scrutiny of the strategies applied by Persian translators some regular patterns were drawn, with the exception that the observed translation behavior may be taken as advantageous information for improving ...
متن کامل